{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# COMPSCI 389: Introduction to Machine Learning\n", "# Topic 5.1 Evaluation Re-Visited\n", "\n", "At the bottom of this notebook, start with the \"Notice\" and \"Answer\" markdown cells collapsed, if possible.\n", "\n", "Recall the following code from before. It does the following:\n", "1. Import relevant libraries\n", "2. Define evaluation metrics\n", "3. Define the KNearestNeighbors model\n", "4. Define the WeightedKNearestNeighbors model" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.neighbors import KDTree\n", "from sklearn.base import BaseEstimator\n", "from sklearn.model_selection import train_test_split\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "def mean_squared_error(predictions, labels):\n", " return np.mean((predictions - labels) ** 2)\n", "\n", "def root_mean_squared_error(predictions, labels):\n", " return np.sqrt(mean_squared_error(predictions, labels))\n", "\n", "def mean_absolute_error(predictions, labels):\n", " return np.mean(np.abs(predictions - labels))\n", "\n", "def r_squared(predictions, labels):\n", " ss_res = np.sum((labels - predictions) ** 2) # ss_res is the \"Sum of Squares of Residuals\"\n", " ss_tot = np.sum((labels - np.mean(labels)) ** 2) # ss_tot is the \"Total Sum of Squares\"\n", " return 1 - (ss_res / ss_tot)\n", "\n", "ts = 0.05\n", "\n", "class KNearestNeighbors(BaseEstimator):\n", " # Add a constructor that stores the value of k (a hyperparameter)\n", " def __init__(self, k=3):\n", " self.k = k\n", "\n", " def fit(self, X, y):\n", " # Convert X and y to NumPy arrays if they are DataFrames\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", " if isinstance(y, pd.Series):\n", " y = y.values\n", "\n", " # Store the training data and labels\n", " self.X_data = X\n", " self.y_data = y\n", " \n", " # Create a KDTree for efficient nearest neighbor search\n", " self.tree = KDTree(X)\n", "\n", " return self\n", "\n", " def predict(self, X):\n", " # Convert X to a NumPy array if it's a DataFrame\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", "\n", " # Query the tree for the k nearest neighbors for all points in X\n", " dist, ind = self.tree.query(X, k=self.k)\n", "\n", " # Return the average label for the nearest neighbors of each query\n", " return np.mean(self.y_data[ind], axis=1)\n", " \n", "class WeightedKNearestNeighbors(BaseEstimator):\n", " # Add a constructor that stores the value of k and sigma (hyperparameters)\n", " def __init__(self, k=3, sigma=1.0):\n", " self.k = k\n", " self.sigma = sigma\n", "\n", " def fit(self, X, y):\n", " # Convert X and y to NumPy arrays if they are DataFrames\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", " if isinstance(y, pd.Series):\n", " y = y.values\n", "\n", " # Store the training data and labels\n", " self.X_data = X\n", " self.y_data = y\n", " \n", " # Create a KDTree for efficient nearest neighbor search\n", " self.tree = KDTree(X)\n", "\n", " return self\n", "\n", " def gaussian_kernel(self, distance):\n", " # Gaussian kernel function\n", " return np.exp(- (distance ** 2) / (2 * self.sigma ** 2))\n", "\n", " def predict(self, X):\n", " # Convert X to a NumPy array if it's a DataFrame\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", "\n", " # We will iteratively load predictions, so it starts empty\n", " predictions = []\n", " \n", " # Loop over rows in the query\n", " for x in X:\n", " # Query the tree for the k nearest neighbors\n", " dist, ind = self.tree.query([x], k=self.k)\n", "\n", " # Calculate weights using the Gaussian kernel\n", " weights = self.gaussian_kernel(dist[0])\n", "\n", " # Check if weights sum to zero. This happens when all points are very far, giving weights that round to zero, causing divison by zero later. In this case, revert to un-weighted (all weights are one).\n", " if np.sum(weights) == 0:\n", " # If weights sum to zero, assign equal weight to all neighbors\n", " weights = np.ones_like(weights)\n", "\n", " # Weighted average of the labels of the k nearest neighbors\n", " weighted_avg_label = np.average(self.y_data[ind[0]], weights=weights)\n", " predictions.append(weighted_avg_label)\n", "\n", " # Return the array of predictions we have created\n", " return np.array(predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's define a function `runTrial` that:\n", "1. Loads the GPA data set\n", "2. Splits it into train and test sets\n", "3. Trains different variants of nearest neighbors on the training data\n", "4. Evaluates the models using the testing data\n", "5. Reports the results" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Highlighting the best values in the DataFrame\n", "def highlight_best(row, best_metrics):\n", " return ['font-weight: bold' if (col in best_metrics and row.name == best_metrics[col]) else '' for col in row.index]\n", "\n", "def runTrial():\n", " # Load the data set\n", " df = pd.read_csv(\"https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv\", delimiter=',') # Read GPA.csv, assuming numbers are separated by commas\n", " #df = pd.read_csv(\"data/GPA.csv\", delimiter=',')\n", "\n", " # We already loaded X and y, but do it again as a reminder\n", " X = df.iloc[:, :-1]\n", " y = df.iloc[:, -1]\n", "\n", " # Split the data into training and testing sets\n", " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=ts, shuffle=True)\n", "\n", " # Model parameters to test\n", " parameters = [\n", " {\"k\": 1, \"sigma\": None}, # Standard NN\n", " {\"k\": 100, \"sigma\": None}, # Standard k-NN\n", " {\"k\": 110, \"sigma\": 90} # Weighted k-NN\n", " ]\n", "\n", " # Dictionary to store results\n", " results = []\n", "\n", " # Training and evaluating each model\n", " for param in parameters:\n", " if param[\"sigma\"] is None:\n", " model = KNearestNeighbors(k=param[\"k\"])\n", " else:\n", " model = WeightedKNearestNeighbors(k=param[\"k\"], sigma=param[\"sigma\"])\n", " model.fit(X_train, y_train)\n", " predictions = model.predict(X_test)\n", "\n", " mse = mean_squared_error(predictions, y_test)\n", " rmse = root_mean_squared_error(predictions, y_test)\n", " mae = mean_absolute_error(predictions, y_test)\n", " r2 = r_squared(predictions, y_test)\n", "\n", " results.append({\"Model\": f\"k-NN k={param['k']} sigma={param['sigma']}\", \n", " \"MSE\": mse, \"RMSE\": rmse, \"MAE\": mae, \"R^2\": r2})\n", "\n", " # Creating DataFrame for results\n", " results_df = pd.DataFrame(results)\n", "\n", " # Finding the best (minimum or maximum) values for each metric\n", " best_metrics = {\n", " \"MSE\": results_df['MSE'].idxmin(),\n", " \"RMSE\": results_df['RMSE'].idxmin(),\n", " \"MAE\": results_df['MAE'].idxmin(),\n", " \"R^2\": results_df['R^2'].idxmax()\n", " }\n", "\n", " # Apply the highlighting\n", " styled_results = results_df.style.apply(highlight_best, best_metrics=best_metrics, axis=1)\n", " display(styled_results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the next cell several times:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 ModelMSERMSEMAER^2
0k-NN k=1 sigma=None1.0437781.0216550.786480-0.585888
1k-NN k=100 sigma=None0.5444470.7378670.5732760.172782
2k-NN k=110 sigma=900.5448620.7381480.5736760.172152
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "runTrial()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notice**\n", "We cannot trust the evaluations of which is better! It often flips when we re-run the code.\n", "\n", "Yes, this is partially because we used a very small test set (5% of the data).\n", "\n", "**Question**: Can this happen when you use a larger portion of the data set (say, 50%)?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answer**: \n", "\n", "Yes! Particularly if the performances are very similar or if there is a small total amount of data.\n", "\n", "To address this, we need to delve a little into probability and statistics." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 2 }